11/21/2020

Baseball

Slide with Bullets

  • We originally wanted to set up a scouting report for every batter in the league, using some custom statistics derived from advanced batting statistics. We would then produce a heatmap and detailed scouting report of what sorts of pitches work on every batter, and where in the zone it worked. Gleefully, we set about finding our data.
  • Our data exists, but isn’t cheap. Back to the drawing board.
  • Instead, we decided to see if we could predict the net number of wins a pitcher was worth based on his ERA (earned run average), OBP (Opponent On-base percentage), and whatever else could think of. We found a package called Lahman that would do the job.

Slide with R Output

  • Here, we’re going to look at how we selected and in some cases created the column data we needed.
#Here, we look at the relevant data from Lahman

#Select columns from the Pitching dataset
pitchers <- select(tibble(Pitching), playerID, yearID, teamID, IPouts, BB, SO, BAOpp, ERA, W, L)


#Create a Net Wins column
pitchers <- pitchers %>% mutate(NetWins = W-L)

#Only keep rows where there is no missing data
pitchers <- pitchers[complete.cases(pitchers),]

#Normalize data so that coefficients are meaningful
pitchers <- pitchers %>% mutate(normIPouts = (IPouts - mean(IPouts)) / sd(IPouts))
pitchers <- pitchers %>% mutate(normBB = (BB - mean(BB)) / sd(BB))
pitchers <- pitchers %>% mutate(normSO = (SO - mean(SO)) / sd(SO))
pitchers <- pitchers %>% mutate(normBAOpp = (BAOpp - mean(BAOpp)) / sd(BAOpp))
pitchers <- pitchers %>% mutate(normERA = (ERA - mean(ERA)) / sd(ERA))

Building linear model

-Finally, we can build our model, as is accomplished below.

#Build Linear Model
mylm <- lm(NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA, data = pitchers)

#Analyze Findings
summary(mylm)
## 
## Call:
## lm(formula = NetWins ~ normIPouts + normBB + normSO + normBAOpp + 
##     normERA, data = pitchers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0234  -1.4264   0.3797   1.3818  22.3251 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002319  0.0158333   0.015   0.9883    
## normIPouts   1.3883292  0.0431626  32.165  < 2e-16 ***
## normBB      -1.8568112  0.0353850 -52.475  < 2e-16 ***
## normSO       1.2529491  0.0316495  39.588  < 2e-16 ***
## normBAOpp   -0.0398335  0.0159113  -2.503   0.0123 *  
## normERA     -0.1265619  0.0163386  -7.746 9.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.288 on 43110 degrees of freedom
## Multiple R-squared:  0.1389, Adjusted R-squared:  0.1388 
## F-statistic:  1391 on 5 and 43110 DF,  p-value: < 2.2e-16
summary(mylm)$r.squared 
## [1] 0.1389159

Building a Generalized Additive Model

-That last model sucked. Let’s try again with a better model.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002319  0.0158333   0.015   0.9883    
## normIPouts   1.3883292  0.0431626  32.165  < 2e-16 ***
## normBB      -1.8568112  0.0353850 -52.475  < 2e-16 ***
## normSO       1.2529491  0.0316495  39.588  < 2e-16 ***
## normBAOpp   -0.0398335  0.0159113  -2.503   0.0123 *  
## normERA     -0.1265619  0.0163386  -7.746 9.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.139   Deviance explained = 13.9%
## GCV =  10.81  Scale est. = 10.809    n = 43116
## NULL

##Building a Generalized Linear Model

-That last model sucked too. Let’s try another one–a generalized linear model.

## 
## Call:
## glm(formula = NetWins ~ normIPouts + normBB + normSO + normBAOpp + 
##     normERA, family = gaussian, data = pitchers)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -22.0234   -1.4264    0.3797    1.3818   22.3251  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002319  0.0158333   0.015   0.9883    
## normIPouts   1.3883292  0.0431626  32.165  < 2e-16 ***
## normBB      -1.8568112  0.0353850 -52.475  < 2e-16 ***
## normSO       1.2529491  0.0316495  39.588  < 2e-16 ***
## normBAOpp   -0.0398335  0.0159113  -2.503   0.0123 *  
## normERA     -0.1265619  0.0163386  -7.746 9.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 10.80891)
## 
##     Null deviance: 541146  on 43115  degrees of freedom
## Residual deviance: 465972  on 43110  degrees of freedom
## AIC: 224998
## 
## Number of Fisher Scoring iterations: 2
## NULL

Building models

-Wow, that went well. Let’s see if we can figure out any kind of a model that works at all–even ones that are nonlinear and completely unintuitive to normal humans.

-We’re going to throw everything at it. We are inevitable.

-Put on that infinity glove and snap your fingers

After the MCU

  • Well look at that, we snapped our fingers and explained just over half of the variation. Excellent work team!

  • Clearly, this isn’t going as well as it might’ve. We decided to look at a crossplot of what sorts of correlation exists.

Crossplot

Interesting Things

To me, the most interesting thing was strikeouts by OBP, so I made an interactive graph.

Interactive OBP x Strikeouts

Another interactive graph

Another interactive graph

Advanced pitching metrics

Slide with Plot